Automated Evaluation for Student Argumentative Writing: A Survey

Xinyu Wang

Riiid Labs

Yohan Lee

Riiid Labs

Juneyoung Park

Riiid AI Research

xinyuwang699@gmail.com yohan.lee@riiid.co juneyoung.park@riiid.co

Abstract

arXiv:2205.04083v1 [cs.CL] 9 May 2022

This paper surveys and organizes research works in an under-studied area, which we call automated evaluation for student argumen- tative writing. Unlike traditional automated writing evaluation that focuses on holistic es- say scoring, this field is more specific: it fo- cuses on evaluating argumentative essays and offers specific feedback, including argumenta- tion structures, argument strength trait score, etc. The focused and detailed evaluation is useful for helping students acquire important argumentation skill. In this paper we organize existing works around tasks, data and methods. We further experiment with BERT on represen- tative datasets, aiming to provide up-to-date baselines for this field.

Introduction
Automated writing evaluation uses computer pro- grams to give evaluative feedback to a piece of written work, which is often used in educational settings (Parra et al., 2019). Most of the work on automated writing evaluation focus on automated essay scoring (AES), which evaluates an essay’s quality by assigning scores (Shermis and Burstein, 2013). It is a long-standing research problem, with the first system proposed in 1966 by Page (1966). Since then this area has been active and attracts a lot of research efforts.
A large portion of AES systems are developed for holistic scoring (Ke and Ng, 2019), which out- puts a single score to represent essay quality. This kind of system is useful for summative assess- ment and can greatly reduce manual grading ef- forts. However, holistic scoring is not sufficient for providing instructional feedback to student learn- ing because a low holistic score does not provide enough information to the students about how to im- prove. To address this issue, many research works tried to build trait-specific scoring systems. These systems concern scoring particular quality dimen- sions of an essay, such as grammar (Burstein et al.,
2004), word choice (Mathias and Bhattacharyya, 2020), coherence (Somasundaran et al., 2014), etc. Holistic scoring systems, along with trait specific scoring systems, have already been deployed to commercial settings successfully (Burstein et al. 2004, Rudner et al. 2006).
However, most of the systems do not distinguish between essay types (e.g. argumentative or narra- tive essay). It makes sense when the system is trying to evaluate type-agnostic dimensions such as word choice of an essay, but the use of such systems is limited when an user wants to know more about the in-depth traits specific to an essay type, e.g. whether claims and counterclaims are developed fairly in an argumentative essay. The evaluation of type specific traits is important because these traits reflect important type-specific skills — for example, whether claims and counterclaims are de- veloped thoroughly is indicative of critical thinking skills, which can be hard to gauge through narrative essay writing.
In this paper we attempt to organize the research works regarding automated evaluation specifically for student argumentative writing, which requires the students to evaluate controversial claims, col- lect and judge evidence and establish a position. The contributions of this paper are as follows. First, we categorized and organized current research body regarding tasks and datasets (section 2) and meth- ods (section 3). Additionally, we experimented with BERT (Devlin et al., 2019) models and pro- vided up-to-date baselines to the community (sec- tion 4). Finally, we suggested directions (section 5) based on missing elements in current research body which we hope to be filled in the future.

Tasks and Datasets

In this section, we introduce the tasks commonly studied in automated writing evaluation for argu- mentative writing, together with their benchmark datasets.

Major Claim 1

For

Claim 1

Claim 2

Support

Premise1

Premise2

Support

Premise3

Premise4

Premise5

Some people argue for and others against and there is still no agreement whether cloning technology should be permitted. However, as far as I’m concerned, [cloning is an important technology for humankind]MajorClaim1 since [it would be very useful for developing novel cures]Claim1 .

First, [cloning will be beneficial for many people who are in need of organ transplants]Claim2 . [Cloned organs will match perfectly to the blood group and tissue of patients]Premise1 since [they can be raised from cloned stem cells of the patient]Premise2 . In addition, [it shortens the healing process]Premise3 . Usually, [it is very rare to find an appropriate organ donor]Premise4 and [by using cloning in order to raise required organs the waiting time can be shortened tremendously]Premise5 .

Figure 1: An illustration of annotation scheme for S&G2014 and S&G2017a. This figure is adapted from Stab and Gurevych (2017a)

Type	Example Tasks
AM	Argument Component Identification Argument Component Classification Argument Relation Identification Relation Labelling
CD	Opposing Argument Detection Valid Critique Detection Thesis Detection
QA	Sufficiency Recognition Argument Strength Scoring

Table 1: Example sub-tasks under the three main prob- lems: argument mining (AM), component detection (CD) and quality assessment (QA)

Argument Mining

Argument mining (AM) aims to identify and parse the argumentation structure of a piece of text. (Lawrence and Reed, 2020) For argumentative es- says, argumentation structures may have variations but can typically be represented as trees, as illus- trated in Figure 1. The root of the tree is a major claim, which expresses the author’s main point on the topic. Children of the major claims are claims, which are controversial statements that either ar- gue for or against their parents. Claims could have children nodes, called premises (e.g. Premise1 and Premise3 on Figure 1), which support or at- tack the corresponding claims. Further, a premise (e.g. Premise3) can also be supported or attacked by other premises (e.g. Premise4 and Premise5), enriching the logical flow of the essay.

In order to generate an argumentation structure

from text, argument mining models typically de- compose the process into four sub-tasks: (1) ar- gument component identification (ACI), which aims to identify the span of argument components (e.g. text with squared brackets on Figure 1); (2) argument component classification (ACC), which classifies argument components into corresponding types (e.g. where a span is a Claim or Premise);

3) argument relation identification (ARI), which

aims at linking related argument components on the argument structure; 4) relation labelling (RL), which focuses on identifying relation type (e.g. for and support) between linked components.

Datasets Stab and Gurevych, 2014a (S&G2014) and Stab and Gurevych, 2017a (S&G2017a) anno- tated the most widely adopted datasets regarding this direction. S&G2014 annotated 90 persuasive essays from essayforum, a site providing feedback for students who wish to improve their writing. S&G2017a adopted similar annotation scheme and annotated additional 402 essays.

Besides, Putra et al. (2021a) annotated 434 es- says written by English to Speakers of Other Lan- guages (ESOL) college learners as quasi-trees. They annotated each sentence as either argumenta- tive component (ACs) or non-argumentative com- ponent (non-AC). Annotators then identified re- lations between ACs. Different from S&G2014, their relation labels include directed labels (sup- port, attack or detailing) as well as an undirected la- bel(restatement). They also reordered the sentences to make the essays better-structured. Wambsganss et al. (2020) annotated 1000 student peer-reviews

Data	No. of Essays	Label Structure	Task Type
Stab and Gurevych 2014a	90	Tree	Argument mining
Stab and Gurevych 2017a	402	Tree	Argument mining
Stab and Gurevych 2016	402	Binary value	Component Detection
Carlile et al. 2018	102	Values on top of tree nodes	Quality Assessment
Stab and Gurevych 2017b	402	Binary value	Quality Assessment
Persing and Ng 2015	1000	Score	Quality Assessment

Table 2: Comparison of several popular public datasets

written in German, indicating whether a text span is claim or premise or not, and the relations between these argumentative components.

Alhindi and Ghosh (2021) annotated on a token level using BIO tagging schema and this dataset can be used for ACI and ACC. Specifically, each token belongs to one of the five classes: the begin token of a claim; a continuous token of a claim; the begin token of a premise; a continuous token of a premise; non-argumentative token.

Component Detection

Besides parsing the complex argumentative struc- tures of essays as in argument mining, a large por- tion of work in automated writing evaluation also considers the simpler task of detecting whether an essay or a text span in an essay contains a spe- cific component. For example, Stab and Gurevych (2016) assigned binary labels to essays depend- ing on whether an essay has discussed the oppos- ing arguments to author’s own standpoint. Falak- masir et al. (2014) labelled essays as whether or not containing a thesis and conclusion statement. (Beigman Klebanov et al., 2017) annotated essays written by college students for criticizing a piece of argument, deciding if each sentence contains a good critique or not. (Ghosh et al., 2020) annotated on whether a sentence contains a valid critique as well, but the essays they used are written by middle school students.
Finally, we remark that component detection is different from the task of argument component clas- sification (ACC) discussed previously in Section
2.1. Specifically, component detection does not require the text span to match the component of in- terest. For example, Falakmasir et al. (2014) cares about whether an essay contains a thesis and con- clusion statement. In their case, the length of text span is much longer than the component of interest.
Quality Assessment
Quality assessment concerns evaluating the quality of an argument. There are many sub-tasks under this category due to the diversity of argumentation quality theories.
Persing and Ng (2015) annotated 1000 essays over 10 prompts from International Corpus of Learner English (ICLE) dataset (Granger et al., 2009) on argument strength. They defined argu- ment strength as how well an essay makes an argu- ment for its thesis and convinces its readers. Hor- bach et al. (2017) collected 2020 German essays written by prospective university students. Their annotators evaluated the text regarding 41 aspects including quality of argumentation. In addition, Stab and Gurevych (2017b) (S&G2017b) adopted Relevance-Acceptability-Sufficiency criteria (John- son and Blair, 2006) and asked annotators to de- cide whether a piece of argument is sufficiently supported or not.
Besides, Carlile et al. (2018) annotated detailed persuasiveness and attributes values based on argu- ment trees from S&G2014 and S&G2017a. They defined an argument as a node in the argument tree along with all its children. For each argument, they assigned an overall persuasive score, common attribute values and type specific attribute values. For example, for arguments using a major claim as the root node, annotations include persuasive score, eloquence, specificity, evidence (common attributes) and persuasive strategies (type-specific attribute).

Methods

All the methods to our knowledge are learning based and they can be divided into supervised learn- ing and unsupervised learning. Most of the work use supervised learning, which can be further di- vided into feature-based approaches and neural ap- proaches. Next, we introduce these approaches in more details.

Feature-based method
For feature-based approaches, off-the-shelf algo- rithms are typically used for model training on hand-crafted input features. For example, support vector machine (SVM), logistic regression and ran- dom forest are typically used for tasks that can be framed as classification (Stab and Gurevych, 2014b, Stab and Gurevych, 2017b, Stab and Gurevych, 2017a, Persing and Ng, 2016, Beigman Klebanov et al., 2017, Wan et al., 2021). Linear regression and support vector machine regression are often used for task that can be framed as regression (Pers- ing and Ng, 2015, Wachsmuth et al., 2016, Persing and Ng, 2015).
Next, we introduce more details of common fea- tures used in these methods.
Lexical features aim to capture word-level in- formation and common lexical features include n- gram and frequent words. They have been shown effective (Stab and Gurevych, 2014b, Beigman Kle- banov et al., 2017, Stab and Gurevych, 2017b). However, they do not perform as well when used in a cross-prompt setting (Beigman Klebanov et al., 2017) where prompts of testing essays are not seen during training. This is intuitive as the actual word- ing of essays of different prompts would differ sig- nificantly.
Syntactic features usually rely on parse trees. Common syntactic features include number of sub- clauses in a parse tree, Boolean indicator of produc- tion rules, part-of-speech tags, etc. Additionally, basic information such as tense of verbs, presence of modal verbs can also serve as syntactic features. (Stab and Gurevych, 2017a) showed that syntactic features are useful for identifying the beginning of an argument component and (Stab and Gurevych, 2017b) suggested that syntactic features are effec- tive for recognizing insufficiently supported argu- ments.
Structural features generally describe the po- sition and frequency of a piece of text. For exam- ple, they include position of a token, punctuation and an argument component. They also include statistics such as number of tokens in an argument component. (Stab and Gurevych, 2017a) reported effectiveness for these features on both ACI and ACC tasks.
Embedding features are based on word vectors that represent words in a continuous space and are supposed to capture more information than simple n-grams. Stab and Gurevych (2017a) summed the
word2vec (Mikolov et al., 2013) vectors for each token to represent a component. Putra et al. (2021b) used BERT (Devlin et al., 2019) to extract token embeddings in their work.
Discourse features captures how sentences or clauses are connected together. One kind of dis- course features depend on discourse markers di- rectly. The markers, such as "therefore", suggest the relationship between current text span and its adjacent text span. Another kind of discourse fea- tures use the output of discourse parsers. For ex- ample, Beigman Klebanov et al. (2017) parsed sen- tences into corresponding discourse roles, and then used these discourse roles as features. Stab and Gurevych (2017a) reported usefulness of discourse features on classifying argument components, indi- cating a correlation between general discourse rela- tion and argument component type. Beigman Kle- banov et al. (2017) found that discourse features remain useful in cross-prompt settings, which is valuable as it’s not always possible to collect a lot of data for a single prompt.
Neural method
As for neural approaches, neural architectures such as long short-term memory (LSTM) networks and convolutional neural networks (CNN) are com- monly adopted (Eger et al., 2017, Alhindi and Ghosh, 2021, Putra et al., 2021b, Mim et al., 2019a, Xue and Lynch, 2020,Stab and Gurevych, 2017b). Besides, Transformer (Vaswani et al., 2017) based architectures such as BERT (Devlin et al., 2019) are adopted recently (Ye and Teufel, 2021, Putra et al., 2021b, Ghosh et al., 2020, Alhindi and Ghosh, 2021 Wang et al., 2020). We will describe more details of that later on.
Use of pretrained models The use of pretrained language models has been popular among natural language processing community. This is because state-of-the-art models are pretrained on massive text corpus, allowing information learned from a huge amount of text to be used for downstream tasks.
There are two ways to use pretrained models. First, it can be used as feature extractor. For ex- ample, Putra et al. (2021b) has used BERT, a bi- directional Transformer based architecture, to ex- tract input embeddings, and then pass the contextu- alized embeddings to downstream networks. Sec- ond, the pretrained language models can be further fine-tuned for the task of interest. Ye and Teufel
(2021), Wang et al. (2020), Alhindi and Ghosh (2021) and Ghosh et al. (2020) used this approach. In addition, other than fine-tuning on the task data directly, Alhindi and Ghosh (2021) and Ghosh et al. (2020) experimented with continued pre-training with a large unlabelled domain relevant corpus first, and then fine-tuning with task data.

Address multiple sub-tasks

As mentioned, argument mining mostly concerns four sub-tasks and they are often addressed together. The naive way to solve them simultaneously is to model each sub-task separately in a pipeline fash- ion. This introduces at least two issues: first, it does not enforce any constraints between different tasks; second, the errors made early on in the pipeline could propagate.

One way to address these is to use integer linear programming (ILP) (Roth and Yih, 2004) for joint inference. Stab and Gurevych (2017a) and Persing and Ng (2016) adopted this approach and Persing and Ng (2016) further proposed an ILP objective that directly optimizes F score.

Another way to address these issues is to build a joint model that can model all these tasks at the same time. Eger et al. (2017) proposed two dif- ferent frameworks for joint modelling the full ar- gument mining task and has been influential. Fig- ure 2 showed an illustration of these two formu- lations. They first framed the task as sequence tagging. For example, they used S&G2017a data for experiments and in this case, each token’s label space would include 1) whether this token is a be- gin or continue of an argument component or it is non-argumentative); 2) the type of the component to which the token belongs; 3) distance between the corresponding component and the component it relates to; 4) the relation type between the two related components. This way they can use off- the-shelf taggers to solve all four sub-tasks at once. They also framed the task as dependency parsing. When framed as dependency parsing, the text is rep- resented as directed trees where each token has a labelled head so that argument component relation information can be encoded. They further labelled these edges with tokens’ component types and rela- tion types so that argument trees can be converted to quasi-dependency trees. Finally, they adapted a joint neural model designed for entity detection and relation extraction (Miwa and Bansal, 2016). In this case, they modelled argument components

Since [it killed many marine lives]Premise , [tourism has threatened nature]Claim .

Since	it	killed	many
(O, ∅, ∅, ∅)	(B, P, 1, Supp)	(I, P, 1, Supp)	(I, P, 1, Supp)
marine	lives	,	tourism
(I, P, 1, Supp)	(I, P, 1, Supp)	(O, ∅, ∅, ∅)	(B, C, ∅, For)
has	threatened	nature	.
(I, C, ∅, For)	(I, C, ∅, For)	(I, C, ∅, For)	(O, ∅, ∅, ∅)

1 2 3 4 5 6 7 8 9

1 1

0 1

(I, P, Supp)

(B, C, For)

(B, P, Supp)

Figure 2: Illustration of the formulations described in

3.3. On the top is an annotated sentence. The second chunk represents sequence tagging formulation and the bottom is dependency parsing formulation. In the fig- ure, B means the begin of a component, I means the continue of a component, C stands for Claim, P stands for Premise, O stands for non-argumentative, Supp

stands for Support and ∅ means not filled. Figure is adapted from Eger et al. (2017)

as entities and argument relations as semantic rela- tions.

Unsupervised learning
Aside from the supervised learning work described above, Persing and Ng (2020) proposed an unsu- pervised method for argument mining. The key for their work is to use heuristics for bootstrapping a small set of labels and then train the model in a self-training fashion.

Experiments
The use of Transformer-based models has been dominating in other natural language processing tasks but is still in its infant stage in this field. Therefore, we experiment with vanilla BERT mod- els on three representative datasets, hoping to facil- itate research to that end. The three datasets cover argumentation structure parsing and argument qual- ity assessment. Specifically, we used S&G2017a dataset, which is the benchmark for parsing es- say argumentation structure. Besides, we used S&G2017b dataset, which assess argument quality from a logic aspect (Wachsmuth et al., 2017) by annotating whether a piece of argument is suffi- ciently supported. We also used P&N2015 dataset which assess argument quality from a rhetoric as- pect (Wachsmuth et al., 2017) by assigning an over- all argument strength score to each essay. Statistics

All
Per Essay

All
Per Essay
Token
147,271
366.3

Token
649,549
649.5
Sentence
7,116
17.7

Sentence
31,589
31.6
Paragraph
1,833
4.6

Paragraph
7,537
7.5
Essay
402
1

Essay
1,000
1

Table 3: Data statistics for the whole S&G2017a dataset.

All
Per Argument
Token
97,370
94.6
Sentence
4,593
4.5
Argument
1,029
1

Number
Percentage
Sufficient
681
66.2%
Insufficient
348
33.8%
Table 4: Data statistics for S&G2017b dataset. It in- cludes the size of the corpus and class distributions.

for these datasets can be found in Table 3, 4, 5 and 6.
1. Implementation details
  S&G2017a For S&G2017a dataset, we followed Eger et al. (2017) and used 286 essays for training, 36 essays for validation and 80 essays for testing on paragraph level. For model building, we used Hug- gingface (Wolf et al., 2020) library’s cased base BERT model. We added one shared dropout layer and three linear heads for predicting labels men- tioned in 3.3. The loss is computed by summing up the cross entropy loss on each head. Note that BERT’s tokenizer can tokenize each word into mul- tiple sub-tokens but we want to predict only one set of tags for each word. To address this, we only use the first sub-token for training. For model training, we used AdamW optimizer (Loshchilov and Hut- ter, 2017), learning rate 3e-5 and Cosine Annealing Warm Restart scheduler (Loshchilov and Hutter, 2016) implemented by PyTorch Lightning library (Falcon et al., 2019). We also monitored validation loss for early stopping. We only tuned dropout rate and patience used for early stopping minimally and we set patience to 5 and dropout rate to 0.5.
  S&G2017b For S&G2017b dataset, we used 823 arguments for training, 103 arguments for vali- dation and 103 arguments for testing. We used the same BERT model and added a dropout layer and a linear layer. We used binary cross entropy loss as objective function. For model hyperparameters, we set patience to 5, dropout rate to 0.3 and kept
  Table 5: Data statistics for P&N2015 dataset.
  
  Score
  1.0
  1.5
  2.0
  2.5
  3.0
  3.5
  4.5
  Essays
  2
  21
  116
  342
  372
  132
  15
  Table 6: Score distribution for P&N2015 dataset.
  
  everything else the same as S&G2017a.
  P&N2015 For P&N2015 dataset, we used 800 essays for training, 100 essays for validation and 100 essays for testing. Additionally, we ran cross- prompt experiments by randomly selecting 247 es- says from prompt 1 for training, 31 essays for vali- dation, and rest of prompt 1 essays as well as essays from other prompts for testing. We also used one dropout layer and one linear layer on top of the base BERT model. For objective function, we used mean square loss. Besides, we set patience to 5 and dropout rate to 0.1 and kept everything else the same as S&G2017a.
2. Metrics and comparison methods
  2T P +FP +FN
  S&G2017a Similar to most work, we followed Persing and Ng (2016) for evaluation. They de- fined F1 score as F1 = 2T P , where TP stands for true positive, FP stands for false positive and FN stands for false negative. They also define ’level α matching’ (Eger et al., 2017): for α% level match, the predicted component span and ground truth share at least α% of their tokens. For com- parison methods, we chose the LSMT-ER model from Eger et al. (2017), which is a common base- line. In addition, we compared with the BiPAM model from Ye and Teufel (2021), which is a BERT- enhanced biaffine dependency parser (Dozat and Manning, 2018).
  S&G2017b We used macro F1 and accuracy scores for evaluation, following Stab and Gurevych (2017b). As for comparison methods, we chose the best performing CNN model in their work and human upper bound.
  P&N2015 We used mean absolute error (MAE) and mean square error (MSE) for evaluation. We compared the baseline with a model developed by Wachsmuth et al. (2016), which is the best perform- ing model to our knowledge.
  
  C-F1
  R-F1
  100% 50%
  100% 50%
  LSTM-ER
  70.8 77.2
  45.5 50.1
  BiPAM
  72.9 N/A
  45.9 N/A
  BERT+linear
  69.3 76.7
  43.7 47.6
  Table 7: Performance of LSTM-ER, BiPAM and vanilla BERT on S&G2017a dataset. The models are all trained on paragraph level and we report both 100% level match and 50% level match results. C-F1 stands for argument component F1 and R-F1 stands for argu- ment relation F1. Best scores are in bold.
  
  Accuracy
  Macro F1
  Human
  0.911±.022
  0.887±.026
  CNN
  0.843±.025
  0.827±.027
  BERT+linear
  0.882±.018
  0.869±.012
  Table 8: Performance of CNN, vanillaBERT and hu-
  form previous methods. There are two possible reasons for explaining it. First, the average token in an single essay exceed the maximum length(512) supported by vanilla BERT model. This results in part of the essay being truncated. From the data statistics, we can know that at least 25% of the essay is being truncated, resulting in large infor- mation loss. Second, both Persing and Ng (2015) and Wachsmuth et al. (2016) encoded argument structure information explicitly by crafting a list of argument structure related features. Mim et al. (2019b) incorporated paragraph’s argument func- tion information by pretraining on large-scale es- says with shuffled paragraphs. At the same time, BERT model was not pretrained on relevant task and might fail at capturing the overall argument structure of an essay.
  man upper bound on S&G2017b. Best scores are in bold.
3. Results and discussion
  S&G2017a The experiment results for this argu- ment mining task are shown in Table 7. The vanilla BERT did not outperform state-of-the-art methods but are close. The BiPAM model, which is a BERT- enhanced dependency parser, performs the best. These demonstrate the power of Transformer-based models for the argument mining task. Additionally, LSTM-ER has 1) designed separate module for handling components and relations; 2) encoded ex- plicit syntax information through syntactic parser. BiPAM has carefully designed the argument re- lation representation so that it can benefit from state-of-the-art dependency parsers. These con- siderations can be used for further unlocking the potential of Transformer-based models.
  S&G2017b The experiment results for suffi- ciency recognition are shown in Table 8. We can see that the vanilla BERT model outperforms the previous best performing CNN model by a large margin. This is expected as the BERT model has been pretrained on a huge amount of text. It is surprising that the vanilla BERT model already achieved near-human performance. Therefore, we foresee that Transformer-based model can equal or even surpass human upper bound on sufficiency recognition in the near future.
  P&N2015 The experiment results for argument strength scoring are shown in Table 9. On this dataset, the vanilla BERT model did not outper-
  
  MAE
  MSE
  Persing2015
  0.392
  0.244
  Wachsmuth2016
  0.378
  0.226
  Mim2019
  N/A
  0.231
  BERT+linear
  0.394
  0.250
  Table 9: Performance of best models from Persing and Ng (2015), Wachsmuth et al. (2016) and Mim et al. (2019b), and vanillaBERT on P&N2015.
  
  To further gauge the generalization ability of Transformer-based models, we ran the vanilla BERT in a cross-prompt setting and the results are shown in 10. Being able to generalize across prompt is valuable in practice because it is expen- sive to collect data for each new prompt. Recall that the model is trained and validated on essays of prompt 1. From 3, we can see that the MSE and MAE remains similar across prompts except for prompt 3 and prompt 5. We took a look at these two prompts and found that they are much more abstract and provide less concrete context than the other prompts. 1 We hypothesize it to be the per- formance drop on prompt 3 and prompt 5 and thus believe that BERT can generalize reasonably well.
  Overall, the vanilla BERT models perform com- parably to previous methods across datasets. We believe this can be a good baseline model for future research in the community.
  
  1Due to licensing of the ICLE dataset, we cannot provide the detailed information here.
  Test Prompt
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  MAE
  MSE
  0.421
  0.237
  0.399
  0.268
  0.485
  0.349
  0.443
  0.286
  0.471
  0.403
  0.389
  0.225
  0.371
  0.217
  0.369
  0.246
  0.398
  0.251
  0.439
  0.283
  
  Table 10: Results of cross-prompt experiments on P&N2015 dataset
  
  0.5
  
  0.4
  
  0.3
  
  0.2
  
  MAE MSE
  
  1 2 3 4 5 6 7 8 9 10
  
  Figure 3: Line chart based on Table 10
  accessible datasets that target authors whose in- terested language skills are not proficient yet, to our knowledge. Specifically, Putra et al. (2021a) annotated essays written by English learners from various Asian countries. They discarded already well-written essays and only reserved the ones with intermediate quality. Alhindi and Ghosh (2021) annotated argumentative essays by middle school students and found these essays less structured and thus more challenging. We believe collecting more diverse datasets would be valuable because it will not only expand the impact of argumentative writ- ing support systems, but also pose more challeng-
Conclusion and Future Directions

In this paper we have organize existing works around tasks, data and methods regarding auto- mated evaluation for student argumentative writing. We provided a baseline to the community by experi- menting with BERT models on benchmark datasets. Now the question would be: what is important for the future? Towards this end, we identified several directions we deem interesting.

First of all, there is a lack of emphasis on lan- guage diversity in current datasets. This concerns two aspect: language of the essays and language backgrounds of the authors. For language of the essays, most datasets are annotated essays written in English. Horbach et al. (2017) and Wambsganss et al. (2020) annotated essays written in German and these two datasets are the only ones not written in English to our knowledge. Intuitively speak- ing, characteristics such as overall argumentation structure would vary across essays written in dif- ferent languages. Besides, annotators for essays written in different languages may have different criteria when assessing the rhetoric quality of an essay. As for authors’ language backgrounds, cur- rent datasets either do not take into account authors’ language levels or use essays written by proficient writers (e.g. college-level writing). 2 Putra et al. (2021a) and Alhindi and Ghosh (2021) are the only

2Although P&N2015 dataset annotated English learner’s texts, most of the authors’ native languages belong to Indo- European languages and have received at least 6 or more years’ English education

ing research problems.

Second, adoption of Transformer-based models is still in the infant stage despite some recent works have used Transformer-based architectures. As de- scribed in 4, we have built the simplest form of BERT models and demonstrated that they have comparable performance to previous state-of-the- art methods. In addition, Ghosh et al. (2020) and Alhindi and Ghosh (2021) showed that per- formance of BERT can be greatly improved by continued pre-training on unlabelled essay datasets or architectural design that takes account of data characteristics. Therefore, we believe there is still huge potential for Transformer-based models.

Third, there is no research regarding generaliza- tion in this field. We are aware of relevant research works in the AES area (Jin et al. 2018, Cao et al. 2020, Ridley et al. 2021) but not for systems that care about argument-related attributes. Not being able to generalize across prompts remains a ma- jor bottleneck in AES (Woods et al., 2017) and we believe this can also be a main obstacle for deploying automated evaluation systems specifi- cally for argumentative essays. This is because it is costly to collect data for every prompt. At the same time, learners usually need to practice writing over a large amount of prompts and get feedback before seeing significant improvement. This echoes our promotion for Transformer-based models as we have shown in 4.3 a vanilla BERT model can generalize well across prompts within one dataset. Besides, Transformer-based models

have shown to be effective on unseen domains in other NLP tasks (Houlsby et al. 2019, Han and Eisenstein 2019). Overall, we believe it is critical to build generalizeable systems and hope to see more research addressing this issue in the future.

References

Tariq Alhindi and Debanjan Ghosh. 2021. “sharks are not the threat humans are”: Argument component segmentation in school student essays. In Proceed- ings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, pages 210– 222, Online. Association for Computational Linguis- tics.

Beata Beigman Klebanov, Binod Gyawali, and Yi Song. 2017. Detecting good arguments in a non-topic- specific way: An oxymoron? In Proceedings of the 55th Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), pages 244–249, Vancouver, Canada. Association for Com- putational Linguistics.

Jill Burstein, Martin Chodorow, and Claudia Leacock. 2004. Automated essay evaluation: The criterion online writing service. Ai magazine, 25(3):27–27.

Yue Cao, Hanqi Jin, Xiaojun Wan, and Zhiwei Yu. 2020. Domain-adaptive neural automated essay scoring. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, pages 1011–1020.

Winston Carlile, Nishant Gurrapadi, Zixuan Ke, and Vincent Ng. 2018. Give me more feedback: Anno- tating argument persuasiveness and related attributes in student essays. In Proceedings of the 56th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 621– 631, Melbourne, Australia. Association for Compu- tational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics.

Timothy Dozat and Christopher D. Manning. 2018. Simpler but more accurate semantic dependency parsing. In Proceedings of the 56th Annual Meet- ing of the Association for Computational Linguis- tics (Volume 2: Short Papers), pages 484–490, Mel- bourne, Australia. Association for Computational Linguistics.

Steffen Eger, Johannes Daxenberger, and Iryna Gurevych. 2017. Neural end-to-end learning for

computational argumentation mining. In Proceed- ings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), volume Volume 1: Long Papers, pages (11–22). Association for Computational Linguistics.

Mohammad Falakmasir, Kevin Ashley, Christian Schunn, and Diane Litman. 2014. Identifying the- sis and conclusion statements in student essays to scaffold peer review.

William Falcon et al. 2019. Pytorch lightning. GitHub. Note: https://github.com/PyTorchLightning/pytorch- lightning, 3.

Debanjan Ghosh, Beata Beigman Klebanov, and Yi Song. 2020. An exploratory study of argumenta- tive writing by young students: A transformer-based approach. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 145–150, Seattle, WA, USA → Online. Association for Computational Linguistics.

Sylviane Granger, Estelle Dagneaux, Fanny Meunier, Magali Paquot, et al. 2009. International corpus of learner English. Presses universitaires de Louvain Louvain-la-Neuve.

Xiaochuang Han and Jacob Eisenstein. 2019. Unsu- pervised domain adaptation of contextualized em- beddings for sequence labeling. In Proceedings of the 2019 Conference on Empirical Methods in Nat- ural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4238–4248.

Andrea Horbach, Dirk Scholten-Akoun, Yuning Ding, and Torsten Zesch. 2017. Fine-grained essay scor- ing of a complex writing task for native speakers. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pages 357–366.

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.

Cancan Jin, Ben He, Kai Hui, and Le Sun. 2018. Tdnn: a two-stage deep neural network for prompt- independent automated essay scoring. In Proceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 1088–1097.

Ralph Henry Johnson and J Anthony Blair. 2006. Log- ical self-defense. Idea.

Zixuan Ke and Vincent Ng. 2019. Automated essay scoring: A survey of the state of the art. In IJCAI, volume 19, pages 6300–6308.

John Lawrence and Chris Reed. 2020. Argument mining: A survey. Computational Linguistics, 45(4):765–818.

Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochas- tic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.

Ilya Loshchilov and Frank Hutter. 2017. Decou- pled weight decay regularization. arXiv preprint arXiv:1711.05101.

Sandeep Mathias and Pushpak Bhattacharyya. 2020. Can neural networks automatically score essay traits? In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 85–91, Seattle, WA, USA → On- line. Association for Computational Linguistics.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositional- ity. In Advances in neural information processing systems, pages 3111–3119.

Farjana Sultana Mim, Naoya Inoue, Paul Reisert, Hi- roki Ouchi, and Kentaro Inui. 2019a. Unsupervised learning of discourse-aware text representation for essay scoring. In Proceedings of the 57th Annual Meeting of the Association for Computational Lin- guistics: Student Research Workshop, pages 378– 385, Florence, Italy. Association for Computational Linguistics.

Farjana Sultana Mim, Naoya Inoue, Paul Reisert, Hi- roki Ouchi, and Kentaro Inui. 2019b. Unsupervised learning of discourse-aware text representation for essay scoring. In Proceedings of the 57th Annual Meeting of the Association for Computational Lin- guistics: Student Research Workshop, pages 378– 385, Florence, Italy. Association for Computational Linguistics.

Makoto Miwa and Mohit Bansal. 2016. End-to-end re- lation extraction using LSTMs on sequences and tree structures. In Proceedings of the 54th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1105–1116, Berlin, Germany. Association for Computational Linguis- tics.

Ellis B Page. 1966. The imminence of grading essays by computer. The Phi Delta Kappan, 47(5):238– 243.

G Parra et al. 2019. Automated writing evaluation tools in the improvement of the writing skill. Interna- tional Journal of Instruction, 12(2):209–226.

Isaac Persing and Vincent Ng. 2015. Modeling argu- ment strength in student essays. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 543–552, Beijing, China. Association for Computational Linguistics.

Isaac Persing and Vincent Ng. 2016. End-to-end ar- gumentation mining in student essays. In Proceed- ings of the 2016 Conference of the North Ameri- can Chapter of the Association for Computational

Linguistics: Human Language Technologies, pages 1384–1394, San Diego, California. Association for Computational Linguistics.

Isaac Persing and Vincent Ng. 2020. Unsupervised ar- gumentation mining in student essays. In Proceed- ings of the 12th Language Resources and Evaluation Conference, pages 6795–6803, Marseille, France. European Language Resources Association.

Jan Wira Gotama Putra, Simone Teufel, and Takenobu Tokunaga. 2021a. Annotating argumentative struc- ture in english-as-a-foreign-language learner essays. Natural Language Engineering, pages 1–27.

Jan Wira Gotama Putra, Simone Teufel, and Takenobu Tokunaga. 2021b. Parsing argumentative structure in English-as-foreign-language essays. In Proceed- ings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, pages 97– 109, Online. Association for Computational Linguis- tics.

Robert Ridley, Liang He, Xin-yu Dai, Shujian Huang, and Jiajun Chen. 2021. Automated cross-prompt scoring of essay traits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13745–13753.

Dan Roth and Wen-tau Yih. 2004. A linear program- ming formulation for global inference in natural lan- guage tasks. Technical report, ILLINOIS UNIV AT URBANA-CHAMPAIGN DEPT OF COMPUTER SCIENCE.

Lawrence M Rudner, Veronica Garcia, and Catherine Welch. 2006. An evaluation of intellimetric™ essay scoring system. The Journal of Technology, Learn- ing and Assessment, 4(4).

Mark D Shermis and Jill Burstein. 2013. Handbook of automated essay evaluation: Current applications and new directions. Routledge.

Swapna Somasundaran, Jill Burstein, and Martin Chodorow. 2014. Lexical chaining for measur- ing discourse coherence quality in test-taker essays. In Proceedings of COLING 2014, the 25th Inter- national conference on computational linguistics: Technical papers, pages 950–961.

Christian Stab and Iryna Gurevych. 2014a. Annotating argument components and relations in persuasive es- says. In Proceedings of COLING 2014, the 25th In- ternational Conference on Computational Linguis- tics: Technical Papers, pages 1501–1510, Dublin, Ireland. Dublin City University and Association for Computational Linguistics.

Christian Stab and Iryna Gurevych. 2014b. Identify- ing argumentative discourse structures in persuasive essays. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 46–56, Doha, Qatar. Association for Computational Linguistics.

Christian Stab and Iryna Gurevych. 2016. Recogniz- ing the absence of opposing arguments in persua- sive essays. In Proceedings of the Third Work- shop on Argument Mining (ArgMining2016), pages 113–118, Berlin, Germany. Association for Compu- tational Linguistics.

Christian Stab and Iryna Gurevych. 2017a. Parsing ar- gumentation structures in persuasive essays. Com- putational Linguistics, 43(3):619–659.

Christian Stab and Iryna Gurevych. 2017b. Recogniz- ing insufficiently supported arguments in argumen- tative essays. In Proceedings of the 15th Confer- ence of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Pa- pers, pages 980–990, Valencia, Spain. Association for Computational Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro- cessing systems, pages 5998–6008.

Henning Wachsmuth, Khalid Al-Khatib, and Benno Stein. 2016. Using argument mining to assess the argumentation quality of essays. In Proceedings of COLING 2016, the 26th International Confer- ence on Computational Linguistics: Technical Pa- pers, pages 1680–1691, Osaka, Japan. The COLING 2016 Organizing Committee.

Henning Wachsmuth, Nona Naderi, Yufang Hou, Yonatan Bilu, Vinodkumar Prabhakaran, Tim Al- berdingk Thijm, Graeme Hirst, and Benno Stein. 2017. Computational argumentation quality assess- ment in natural language. In Proceedings of the 15th Conference of the European Chapter of the Associa- tion for Computational Linguistics: Volume 1, Long Papers, pages 176–187, Valencia, Spain. Associa- tion for Computational Linguistics.

Thiemo Wambsganss, Christina Niklaus, Matthias Söll- ner, Siegfried Handschuh, and Jan Marco Leimeister. 2020. A corpus for argumentative writing support in german. arXiv preprint arXiv:2010.13674.

Qian Wan, Crossley Scott, Banawan Michelle, Balyan Renu, Tian Yu, McNamara Danielle, and Allen Laura. 2021. Automated claim identification using nlp features in student argumentative essays. Educa- tional Data Mining.

Hao Wang, Zhen Huang, Yong Dou, and Yu Hong. 2020. Argumentation mining on essays at multi scales. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5480–5493, Barcelona, Spain (Online). Interna- tional Committee on Computational Linguistics.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- icz, Joe Davison, Sam Shleifer, Patrick von Platen,

Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language pro- cessing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Asso- ciation for Computational Linguistics.

Bronwyn Woods, David Adamson, Shayne Miel, and Elijah Mayfield. 2017. Formative essay feedback using predictive scoring models. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 2071– 2080.

Linting Xue and Collin F. Lynch. 2020. Incorporating task-specific features into deep models to classify ar- gument components. Educational Data Mining.

Yuxiao Ye and Simone Teufel. 2021. End-to-end ar- gument mining as biaffine dependency parsing. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Lin- guistics: Main Volume, pages 669–678, Online. As- sociation for Computational Linguistics.

	All	Per Essay		All	Per Essay
Token	147,271	366.3	Token	649,549	649.5
Sentence	7,116	17.7	Sentence	31,589	31.6
Paragraph	1,833	4.6	Paragraph	7,537	7.5
Essay	402	1	Essay	1,000	1

	All	Per Argument
Token	97,370	94.6
Sentence	4,593	4.5
Argument	1,029	1
	Number	Percentage
Sufficient	681	66.2%
Insufficient	348	33.8%

	C-F1	R-F1
	100% 50%	100% 50%
LSTM-ER	70.8 77.2	45.5 50.1
BiPAM	72.9 N/A	45.9 N/A
BERT+linear	69.3 76.7	43.7 47.6

	Accuracy	Macro F1
Human	0.911±.022	0.887±.026
CNN	0.843±.025	0.827±.027
BERT+linear	0.882±.018	0.869±.012

	MAE	MSE
Persing2015	0.392	0.244
Wachsmuth2016	0.378	0.226
Mim2019	N/A	0.231
BERT+linear	0.394	0.250

Score	1.0	1.5	2.0	2.5	3.0	3.5	4.5
Essays	2	21	116	342	372	132	15

Automated Evaluation for Student Argumentative Writing: A Survey

Xinyu Wang

Yohan Lee

Juneyoung Park

Abstract

Introduction

Tasks and Datasets

Argument Mining

Component Detection

Quality Assessment

Methods

Feature-based method

Neural method

Use of pretrained models The use of pretrained language models has been popular among natural language processing community. This is because state-of-the-art models are pretrained on massive text corpus, allowing information learned from a huge amount of text to be used for downstream tasks.

Address multiple sub-tasks

Unsupervised learning

Experiments

Implementation details

Metrics and comparison methods

Results and discussion

P&N2015 The experiment results for argument strength scoring are shown in Table 9. On this dataset, the vanilla BERT model did not outper-

Conclusion and Future Directions

References